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METOA MAINHMHHOPO OBYYEHHA JLT AETEKTAPOBAHHA 
BPEXOHOCHOTDLO IO, HCHOJIB3S YIOWIHN W3BJIEVEHHE 
TIPH3HAKOB H3 HCIMOJIHAEMBIX ®ALJIOB 


A malicious software is generally an executable program which usually settles itself in the system, replicates 
by copying itself, and has a malicious effect. Modern antivirus systems detect malware by knowing its pattern and 
detect a new virus quite difficult. There are a lot of heuristic techniques are used for detecting an unknown malware 
which are usually consume a lot of system memory and CPU resources. This load can be overcome by training a 
machine learning model which collects features from Portable Executable (PE) file which are used for identifying an 
unknown virus patterns. A technique to collect these features from PE file is proposed in this paper. 
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Bpegouocuoe IO, kak npaBusio, NpeycTaBaeT Cobol UCHOJHAeMYIO IporpaMMy, KOTOpad OOBIHHO pacnosiara- 
eTCA B CHCTeMe, peINIMWupyeTca MyTeM KOMMpOBaHHA UM OKa3bIBaeT BPeCOHOCHOe BO3AeHCTBUe. CoBpeMeHHBIe aHTH- 
BUpyCHble CHCTeMbI OOHapyxKUBAaIOT BpeqoHocHoe IIO, 3Haa ero NaTTepH, a OOHAPy2KHBaTb HOBbI BUPyC JOBOJBHO 
CIOHKHO. CyleCTBYeT MHOXKECTBO 9BPHCTHYCCKHX MCTOJOB, HCHONB3YCMBIX JIA OOHAPyKCHHA HEH3BECTHBIX BpeO- 
HOCHBIX IIporpaMM, KOTOpble OOBIMHO MOTpeOAIOT MHOTO CHCTeEMHOM WaMATH UM pecypcoB mpoueccopa. ItTy Harpy3ky 
MOXKHO IpeosOeTb WyTeM OOyYeHHA MOJeIH MalIMHHOrO ObyYyeHHA, KOTOpad coOupaeT faHHbIe u3 Portable 
Executable (PE) batina, KoTopble HCMOMb3YIOTCA VIA HACHTHPUKAUMU HEM3BECTHBIX BUPYCHBIX MaTTepHoB. B WaHHol 
cTaTbe lipeyiaraetca MeTO COopa STHX XapakTepucTHuK u3 PE-daiia. 

Karoyesbrie c10Ba: Bpeqouocubie nporpamMpl, MamimuHoe obyyenue, DBpuctuKa, PE-aiinnr 


Introduction 

Nowadays one of the main problems 
of the informational security is the identi- 
fying new malicious software and threats. 
Known viruses do not pose a particular 
danger since they are easily detected by hash 
analysis. But detection of new threats requi- 
res advanced heuristic methods. There are 
several ways to identify such threats: 
1. Reveal similarities between malware 


features which are compared with known 
virus patterns. 

2. Implement a set of algorithms that emu- 
late the decision-making strategy of a 
human analyst. A human malware ana- 
lyst can determine that a Windows PE 
program appears malicious, without 
actually observing its behavior, by 
briefly analyzing the file structure and 
taking a quick look at the disassembly of 


families by focusing on the biggest mal- 
ware groups. It usually based on such 
machine learning (ML) algorithms as 
Bayesian networks or genetic algo- 
rithms. The input file produces several 
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the file. The analyst would be asking the 
following questions: Is the file structure 
uncommon? Is it using tricks to fool a 
human? Is the code obfuscated? Is it 
using any anti-debugging tricks? If the 
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answer to such questions is “yes”, then a 
human analyst would suspect that the 
file is malicious. 

3. Analyze a suspected file in a “sandbox” 
which require implementing User-mode 
and kernel-mode hooks. This approach 
allows to execute a suspected file in a 
virtual environment to look for suspi- 
cious activities. It means that we can 
observe the real behavior of the 
executable file [1]. 

Limitations of the existing approaches 

As described, there are several approa- 
ches for detecting an unknown malware. 
However, each of the approaches has its 
limitations: 

1. The first method can cause a large 
number of false positives and consume a 
lot of resources, which is acceptable in 
malware research lab environment and is 
not suitable for desktop solutions. 

2. The second method is more reliable than 
any other approach because it involves 
actually looking at the true runtime 
behavior. However, it’s too complex and 
consumes a lot of system memory and 
CPU resources. 

3. The third method largely depends on the 
quality of the corresponding CPU emu- 
lator engine and the quality of emulated 
operation system APIs. Even if it is 
effective it is time consuming and costly. 

The proposed approach 

This paper describes the method of 
PE-files features extraction to determine if 
the file is malicious and the performance 
evaluation of this method. The method was 
tested on unpacked Windows x86 execu- 
table files. 

The main approach allows determi- 
ning what is the suspected file supposed to 
do by collecting the following features: 

1. Common features — features of the file 
itself. 

2. PE structure features — features of PE 
header and Import table. 

3. Code features — features taken from 
disassembler listing. 

4. Behavioral features — most common 
patterns for the malware. 
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A virus usually needs to settle itself in 
the system. It means it should use functions 
for accessing and editing the Windows 
registry. Also it can copy itself to system 
directories which means it has to use func- 
tions for escalating privileges and coping a 
file. After a malware is settled in the registry 
and spread itself over the system it can 
access the remote host for requesting a 
command from it. It can be done by calling 
networking functions. 

All these functions are combined such 
a way that we can likely predict the behavior 
of the analyzed file. Since we know from PE 
header the exact location of IAT and 
because of its relatively small size we can 
collect the functions combination set of a 
suspected file without consuming a lot of 
OS resources. A neural network which was 
trained with a set of functions combination 
of malicious and legitimate files can deter- 
mine a supposed behavior of the suspected 
file without actually executing it. 

Moreover, a PE file contains a lot of 
data which helps to determine if the file was 
edited. We can check several values such as 
sizes of code and data sections, the offset of 
Original Entry Point (OEP), the Relative 
Virtual Address (RVA) of PE Image, the 
RVA and size of Import Address Table 
(IAT), Export Address Table (EAT) and 
Resource Table. 

These features can prove us that the 
file was patched in case if its OEP was 
rewritten or it has a certain section size for 
one particular virus. 

The most of code and behavioral featu- 
res are non-numerical. The machine learning 
model of our approach needs data in nume- 
rical form but we can’t encode these features 
as numbers because they won’t be true cate- 
gorical features. Instead these features will be 
converted into a separate binary feature that 
has value 1 for instances for which the cate- 
gory appeared and value 0 when it didn’t. 
Hence, each categorical feature is converted to 
a set of binary features, one per category. [2] 

Common features include three values: 
1. File type (DLL, console, GUI, native) — 

categorical feature. 
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2. File entropy — may point that the file con- 
tains en encrypted content. 
3. File size. 

PE structure features include Import 
table (IAT) features and the values of PE 
header fields. 

For collecting the IAT features the set of 
96 Win32 API functions was created. There 
are functions which are commonly used by 
malware as well as antiviruses such as 
AdjustTokenPrivileges, CreateRemoteThread, 
GetProcAddress, VirtualProtectEx, 
WriteProcessMemory and a lot of others. 

Based on statistics from 46000 mali- 
cious programs only 56 functions were kept. 
These most popular functions are used as 
categorical features. 

Code features include the periodicity 
of CPU registers and instructions using. 
Control flow graph features are also related 
to code features: vertex count, edge count, 
delta max and density. 

The periodicity of registers using can 
help detecting different malicious technics 
such as a current virtual address revealing 
which was a well-known technic of file 
viruses. This technic is illustrated in the 
Figure 1. 

call $+5 
pop eax 
Fig. 1. Current RVA revealing 


When a malware uses register for rela- 
tive addressing like in the example above or 
for storing the address of a dynamically load 
library the periodicity of using this register 
usually grows or falls respectively. 

Behavioral features represent the po- 
pular behavior patterns which is typically 
used by the malware. The features are 
collected by parsing the disassembler listing. 
The most popular 20 malicious technics 
were selected: 1) Current RVA revealing; 
2) VirtualAlloc with RVA rights; 
3) WriteProcessMemory to the current 
process memory; 4) WriteProcessMemory to 
the remote process memory; 5) DLL 
injection; 6) Keylogger routine; 7) Registry 
modification; 8) WinInet API using; 9) PEB 
address obtaining; 10) Process replacement; 
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11) User Mode APC injections; 12) Process 
Hollowing; 13) Thread Execution Hijacking; 
14) SetWindowsHookEx using; 15) Extra 
Window memory = Injection (EWMD); 
16) Inline Hooking; 17) Kernel Mode APC 
hooking; 18) SSDT hooking; 19) IDT 
hooking; 20) SYSENTER/SYSCALL hook. 

Features collection 

The basic idea of Fisher Score is to 
find a subset of features of the data such that 
in the data space spanned by selected featu- 
res, distance between data points in different 
classes are as large as possible and distance 
between data points in the same class are as 
small as possible. Fisher score computes the 
difference, in terms of mean and standard 
deviation, between positive and negative 
examples relative to a particular feature. It 
assigns ranks to each feature. Rank of a fea- 
ture is defined as the ratio between absolute 
difference between the means of positive 
and negative examples and the sum of the 
standard deviations of the positive and 
negative examples, when considering that 
feature [3-11]. A large value of a rank 
implies greater difference in positive and 
negative examples, considering that feature, 
hence is more important for separating 
positive and negative values. Thus, this 
feature is relevant. A small value of rank 
would imply a lesser difference in positive 
and negative examples, hence is less impor- 
tant for separating positive and negative 
values [12-15]. Thus, this feature is 
irrelevant. 


ep ~ Hi 
Oo. +0. 


i,p i,n 


R= () 


where Rj — the rank of i feature, Lip 
and pin — the mean of legitimate and 
malicious examples features correspond- 
dently, oi,» and ojn — the standard deviations. 

The Fisher Score approach was 
applied to PE header features and Code 
features. Figures 2 and 3 show Fisher ranks 
for these groups of features. 
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Fig. 2. 30 code features with highest ranks 


Fisher Score 


MajorSubsystemVersion 
Subsystem 
MajorOperatingSystemVersion 
MajorLinkerVersion 
ImageBase 
FileAlignment 
DilCharacteristics 
SizeOfHeaders 
MinorOperatingSystemVersion 
Characteristics 
BaseOfCode 
BaseOfData 
SizeOfStackReserve 
MinorLinkerVersion 
MinorSubsystemVersion 
SectionAlignment 
MajorimageVersion 
AddressOfEntryPoint 
SizeOfStackCommit 
sizeOfHeapReserve 
SizeOfHeapCommit 
CheckSum 
SizeOfOptionalHeader 
NumberOfRvaAndSizes 
Machine 

LoaderFlags 
SizeOfUninitializedData 
sizeOfCode 
sizeOflmage 
SizeOfinitializedData 
MinorimageVersion 
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Fig. 3. Ranks of PE header features 


Only 29 code features were left from 
119 in total. PE header features number were 
reduced to three. The selected PE header 
features are ImageBase, FileAlignment, 
DilCharacteristics. 

Model testing 

The training set consists of 26628 
malicious samples and 9115 legitimate 
samples. The testing set consists of 9115 
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malicious and 814 legitimate samples. 

The model was trained and tested using 
five machine learning algorithms: 1) Decision 
Tree (DT); 2) «Random Forest» (RF); 
3) Gradient Boosting (GB); 4) Adaptive 
Boosting (AB); 5) Naive Bayes classifier 
(GNB). 

The results of the model testing are 
shown in the Table 1. 
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Table |. Classification results 
Ta Rn T Ri FP FN 
GNB 1839 385 | 30.4% | 11.1% 
DT 1085 1139 | 18.6% | 33.2% 
RF 1410 | 859 | 814 | 1365 | 13.4% | 38.1% 
AB 1593 631 19.9% | 11.7% 
GB 1830 394 | 27.1% | 8.2% 


Tm and T; are numbers of malicious 
and legitimate sample in the testing set. Rm 
and R; are the number of samples that were 
detected as malicious and legitimate respec- 
tively. FP and FN are False-positive and 
False-negative rates. 

The best result was shown by Gradient 
Boosting algorithm. Only 8.2% of malicious 
samples were classified as legitimate. 

Conclusion 

In this paper, the feature selection ap- 
proach for PE files was described. The ap- 
proach includes the selection of static featu- 
res (such as PE header fields values and file 
entropy) and behavioral features (the popu- 
lar malicious patterns). 

The described approach was applied to 
a set of malicious and legitimate samples to 
demonstrate its efficiency. 

The model optimization and classifier 
improvements can be executed in further. 
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PE3IOME 

A.A. Boponos, .B. Copoxospuk 

Mero MauwmMHHoro o6y4eHnA IA 
jeTeKTHpoBaHHaA BpexoHnocHoro — ILO, 
HCHOb3yYIOWMH W3BIeVeHHe MpH3HaKoB 
M3 MCHOJHAeCMBIX (PanJIOB 

B yaHHo paOote onmMcbiBaeTca MeTOL 
H3BJICUCHHA Pa3IM4HbIX TPH3HAKOB 3 HCHOII- 
HACMBIX (aiiJIOB C WeyIbIO OOy4eHHA MOJeuH 
MallMHHOrO OOy4eHHA [Id WeTeKTHpOBaHHA 
BPeOHOCHOrO IporpaMMHOro oGecrieueHHA. 

B nactosilee BpeMaA OAHOM U3 OCHOB- 
HbIX TIpoOseM HHMOpMalMOHHON Oesorac- 
HOCTH ABJIACTCA BbIABJICHHe HOBbIX BHOB 
BpeXOHOCHOrO porpaMMHoro § oOecrie4ye- 
Hua. Vi ecim yKe U3BeCTHbIC BPCOHOCHBIC 
IIporpaMMBI He IipeCTaBIIAIOT OCcOOON orac- 
HOCTH, TaK KaK JIerKO OMpeeuAIOTCA C T0- 
MOUbIO CHIHaTypHoro aHaM3a, TO A 
oOHapyxKeHHA HOBbIX, paHee He BbIABJIAB- 
WIMXCA, YTPO3 HCIIONIb3yIOTCA Ooee MpOABU- 
HyTbIe 3BPHCTH4eCKHE METOIEI. 

OrpomHoe kojIMY¥eCTBO SBpHCTHYeC- 
KHX MCTOJOB, HCIIOJIb3YeCMBIX JIA JeTeKTH- 
poBaHHa BpeOHOCHOrO mporpaMMHoro 
oOecleveHHaA pa3zIM4HbIMH AaHTHBUpyCHbIMH 
IIpOAyKTaMu, OOBIYHO MOTpeOsAFOT 3HadH- 
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TeJIbHO€ KOJIMYCCTBO PeCyYPCOB 93JICEKTPOH- 
HOM BEIYHMCIIMTeIbHOK MaliMHbI MW Iporec- 
copa. JlaHHad Harpy3Ka MOxeT ObITb CHHKe- 
Ha 3a cuéT OOy4eHHA HeMpoHHOM ceTH, KO- 
Topad, Ha OCHOBaHHM cOOpaHHbIx IIpv3Ha- 
KOB MHCIIOJIHACMBIX (paiijIoB, Olpeyeaer 
W1aOJIOHbI MOBCACHHA BUPYyCoB. 
IIpequaraeMbli MeTO 3aKIKOUAaeTCA B 

W3BJICUCHHH CTaTH4YeCKHX HW WHHAaMHYeCKHX 

IipH3HakoB 6e3 UciomHeHuA daiina u B 

olpeyeseHHu HauOojee BECOMBIX M3 HHX C 

MOMOL{bIO KpHTepua Duwiepa HW COOTBETCT- 

BY€T CJICXYIOWIMM TpeOoBaHHAM: 

1. {ia aHanw3a noBeyeHuA MporpaMMbl 
TpeOyeTca TOUbKO eé MHCIOJHACMBIM 
bain. He mpeamonaraetca 3attyck mpo- 
TpaMMBI JIA erO JMHAMMYeCKOTO aHa- 
JIM3a C TOMOLMIbIO BUPTyaJIbHbIX MaLLIHH, 
API-norrepoB 4 OMOJIHHTeEIbHOTO 
IIporpaMMHOro OOecrieyeHHaA. 

2. Anasu3 cbaiisia OcyLIecTBIIAeTCA Ha OCHO- 
BaHHU OObIOrO KOM4eCTBAa IIpH3Ha- 
KOB, TIOJIYACHHbIX HeMOCpeJCTBeEHHO HX 
ucnouHAeMoro (aia. 

3. I[pu3HakH MOryT OTHOCHTBCA KaK K 
CIpyKType WcnouHAeMOroO (paiisa, TaK 
K eFO HOBECHHI0. 

4. IIpu mMogenMpoBaHuu TpH3HaKoB mpey- 
MOUTeHHeE OTAaeTCA UYMCIICHHbIM, He?%Ke- 
JIM KaTeropHasIbHbIM. 

5. Bce kaTeropvasIbHble TIpH3Haku mpeocdpa- 
3YIOTCA B YHCIICHHbIe OMHApHble Mpu3Ha- 
KH, JJId TOTO 4TOObI H30exKATb ciy4ait- 
HOCTH IPH HHAeKCHpOBaHHH KaTeropui. 

. Jia oOyueHna WH TecTHpoBaHuaA Mose 
MCHOJIb3YIOTCA TOJIbKO «HeylaKOBaH- 
Hie» PE-cbaiisl. 

OnucbiBaeMbIM MeTO NOAKpenméH pe- 
3yIbTaTaMH TECTHPOBaHHA ATH asIrOpuT- 
MOB MallIMHHOTO OOyyeHuaA: J[epeBo peliie- 
Hun, «CiyyaiHpit jiec», paawveHTHpli u 
AjjanTuBHEii OycTuHr u Hanpupili Oatie- 
COBCKHH KyIaccuduKkatop. 
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